Cross-Cloud Plots: Scalable Tools for Spatial and Multidimensional Data Mining
نویسندگان
چکیده
We focus on the problem of finding patterns across two large, multidimensional datasets. For example, given feature vectors of healthy and of non-healthy patients, we want to answer the following questions: “Are the two clouds of points separable?”, “What is the smallest/largest pair-wise distance across the two datasets?”, “Which of the two clouds does a new point (feature vector) come from?”. We propose a new tool, the ‘Cross-Cloud plot’, which helps us answer the above questions, and many more. We present an algorithm to compute the Cross-Cloud plot, which requires only a single pass over the datasets, thus scaling up to arbitrarily large databases. More importantly, it scales linearly with the dimensionality, while most other spatial data mining algorithms explode exponentially. We show how to use our tool for classification, when traditional methods (nearest neighbor, classification trees) may fail. We also provide a set of rules on how to interpret a Cross-cloud plot, and we apply these rules on multiple, synthetic and real datasets.
منابع مشابه
Assessment of uncertainty for coal quality-tonnage curves through minimum spatial cross-correlation simulation
Coal quality-tonnage curves are helpful tools in optimum mine planning and can be estimated using geostatistical simulation methods. In the presence of spatially cross-correlated variables, traditional co-simulation methods are impractical and time consuming. This paper investigates a factor simulation approach based on minimization of spatial cross-correlations with the objective of modeling s...
متن کاملRapid AkNN Query Processing for Fast Classification of Multidimensional Data in the Cloud
A k-nearest neighbor (kNN) query determines the k nearest points, using distance metrics, from a specific location. An all k-nearest neighbor (AkNN) query constitutes a variation of a kNN query and retrieves the k nearest points for each point inside a database. Their main usage resonates in spatial databases and they consist the backbone of many location-based applications and not only (i.e. k...
متن کاملBig Data Mining in the Cloud
Big Data is the growing challenge that organizations face as they deal with large and fast-growing sources of data or information that also present a complex range of analysis and use problems. Digital data production in many fields of human activity from science to enterprise is characterized by an exponential growth. Big data technologies will become a new generation of technologies and archi...
متن کاملAn Efficient Secret Sharing-based Storage System for Cloud-based Internet of Things
Internet of things (IoTs) is the newfound information architecture based on the internet that develops interactions between objects and services in a secure and reliable environment. As the availability of many smart devices rises, secure and scalable mass storage systems for aggregate data is required in IoTs applications. In this paper, we propose a new method for storing aggregate data in Io...
متن کاملData Replication-Based Scheduling in Cloud Computing Environment
Abstract— High-performance computing and vast storage are two key factors required for executing data-intensive applications. In comparison with traditional distributed systems like data grid, cloud computing provides these factors in a more affordable, scalable and elastic platform. Furthermore, accessing data files is critical for performing such applications. Sometimes accessing data becomes...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001